INN Hotels Project

Context

A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

Objective

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

Importing necessary libraries and data

Data Overview

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Exploratory Data Analysis (EDA)

Questions:

  1. What are the busiest months in the hotel?
  2. Which market segment do most of the guests come from?
  3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
  4. What percentage of bookings are canceled?
  5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
  6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

1. What are the busiest months in the hotel?

Observations:

2. Which market segment do most of the guests come from?

Observations:

3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?

Observations:

4. What percentage of bookings are canceled?

Observations:

5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?

Observations:

6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

Observations:

Univariate Analysis

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Bivariate Analysis

Observations:

Observations:

Observations:

market_segment_type vs booking_status

Observations:

arrival_year vs booking_status

Observations:

repeated_guest vs booking_status

Observations:

no_of_previous_cancellations vs booking_status

Observations:

arrival_month vs arrival_year

Observations:

Observations:

Data Preprocessing

Columns with values containing space is replaced by underscore ('_')

There is multicollinearity present in the data and hence dropping few more columns

Treating Outliers

Data Preparation

Encoding Not_Canceled as 1 and Canceled as 0

EDA

Summary of EDA

Data Description:

Observations from EDA:

Checking Multicollinearity

Building the model

Building a Logistic Regression model

Logistic Regression (with Sklearn library)

Observations:

Logistic Regression (with statsmodels library)

Observations:

ROC-AUC

Model Performance Evaluation

Model Performance Improvement

Let's use Precision-Recall curve and see if we can find a better threshold

Let's check the performance on the test set

Using model with default threshold

Using model with threshold=0.76

Using model with threshold=0.58

Model performance summary

Final Model Summary

Conclusion:

Building a Decision Tree model

Checking model performance on training set

Checking model performance on test set

Reducing overfitting

Hyperparameter tuning is also tricky in the sense that there is no direct way to calculate how a change in the hyperparameter value will reduce the loss of your model, so we usually resort to experimentation. i.e we'll use Grid search Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters. It is an exhaustive search that is performed on a the specific parameter values of a model. The parameters of the estimator/model used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

Checking performance on training set

Checking performance on training set

Visualizing the Decision Tree

checking performance on training set

checking performance on test set

Visualizing the Decision Tree

Model Performance Comparison and Conclusions (Comparing all the decision tree models)

Conclusions:

The model built can be used to predict if a customer is cancelling booking based on following reasons:

Actionable Insights and Recommendations